Effect of Dirty Data on Analysis Results

نویسندگان

  • Dominique Haughton
  • Mary Ann Robbert
  • Linda P. Senne
  • Vismay Gada
چکیده

Abstract: Information quality assessment is the process of inspecting business information to ensure that it meets the needs of the knowledge workers who depend on it. We suggest in this paper that, prior to implementing a system to assess quality, those responsible for information quality can use a subset of clean data to create a statistical model of a decision that relies on the information. Simulated perturbations of the clean data can then be used to establish a boundary for determining what degree of error produces erratic, unusable results. This approach has the advantage that it can be used to show the effects of poor data quality on the result of the analysis as accuracy declines for any reason. We do not focus on the reasons why the quality declines but rather show the consequences of poor quality data on the results of the analysis. Moreover, we examine a general error structure, one that is common in situations where errors are not additive, and is more general than that previously considered in the literature,

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Moving up the Energy Ladder: The Effect of an Increase in Economic Well-Being on the Fuel Consumption Choices of the Poor in India

While the wealth effect from a capital influx would induce households to buy more dirty fuel, they might simultaneously reduce their purchases depending on the degree of substitutability with clean fuels and other consumption. This substitution effect will not necessarily dominate the wealth effect, unless dirty fuels are an inferior good. However, a capital increase could also cause labor to b...

متن کامل

Impacts of Dirty Data: and Experimental Evaluation

Data quality issues have attracted widespread attention due to the negative impacts of dirty data on data mining and machine learning results. The relationship between data quality and the accuracy of results could be applied on the selection of the appropriate algorithm with the consideration of data quality and the determination of the data share to clean. However, rare research has focused o...

متن کامل

ویژگی‌های روان‌سنجی نسخه فارسی پرسشنامه دوجین کثیف

Objectives: Dark triad is a new formulation of maladaptive personality that is composed of Machiavellianism, subclinical narcissism, and subclinical psychopathy. The aim of the current research was to study the psychometric properties of the short form of Dirty Dozen Scale among Iranian population. Method: In this cross sectional study, 300 university students in 2014-15 academic year were sele...

متن کامل

Effect of Dirty Data on Free Text Discharge Diagnoses used for Automated ICD-9-CM Coding

We discuss data quality issues that emerge when applying text mining classification methods for automated ICD-9-CM coding. In particular our work investigates the extent to which errors in input text data propagate to the classification model. Text classification techniques based on two Bayesian machine learning algorithms (naive Bayes and shrinkage) were applied to a set of free-text outcome d...

متن کامل

Systemic review and meta-analysis of randomized clinical trials comparing primary vs delayed primary skin closure in contaminated and dirty abdominal incisions.

IMPORTANCE Surgical site infection remains a major challenge in surgery. Delayed primary closure of dirty wounds is widely practiced in war surgery; we present a meta-analysis of evidence to help guide application of the technique in wider context. OBJECTIVE To determine using meta-analysis whether delayed primary skin closure (DPC) of contaminated and dirty abdominal incisions reduces the ra...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003